A lexical approach to text alignment using Intex

نویسندگان

  • Duško Vitas
  • Cvetana Krstev
چکیده

This paper describes the work in progress on application of Intex to text alignment. Lexical resources incorporated in Intex and local grammars are used to identify those elements in translated text that usually represent literal translations of the original. The motivation for the experiment and basic ideas of the algorithm are illustarted. 1. The motivation for the production of aligned resources Compilation of Serbian parallel corpora at the Faculty of Mathematics (Belgrade) began with the participation in the TELRI project, and production of CD "East meets West A Compendium of Multilingual Resources", which contains, among other resources Plato's Republic aligned in 17 languages and Orwell's 1984 aligned in 8 languages, in both cases including Serbian. We have continued to collect texts, mainly for French-Serbian parallel corpus where French is source and Serbian target language. Corpus consists predominantly of literary and newspaper texts. Aligned literary texts include among others Voltaire's Candide, J. Vern's Le tour du monde en quatre-vingt jours, G. Flaubert's Bouvard et Pécuchet, and P. Louys's La Femme et son pantin. In some cases two transaltion in Serian were obtained. These texts have been tagged to the sentence level, and then aligned using either Vanilla (Danielsson, 1997) or MLAlign alignment program (Romary, 1995). The obtained output had to be in both cases hand proven. Certain number of contemporary texts translated to Serbian is being prepared for the alignment as well. Those are mainly texts related to philosophy, sociology, ethnology, and sciences. However, for contemporary texts, contrary to literary classics, it was usually difficult to obtain both the original and the translation in digital form. The newspaper corpus consists mainly of the French monthly "Le Monde Diplomatique" and its translation to Serbo-Croatian. The acquisition of texts has started in March 2001, when the publication of Serbian translation began, and collecting of both source text and target text is done regularly since: the source texts are downloaded from the "Le Monde Diplomatique" site while the translation is obtained directly either from the publisher or translator. Part of these articles has been aligned, the other part is in the phase of preprocessing in order to be aligned. The motivation for the construction of this kind of multilingual resource is multifold: 1. It can be used as a powerful linguistic resource, for instance for language teaching. 2. In bilingual or multilingual lexicography aligned texts can be a source of reliable data. For the production of traditional bilingual dictionaries aligned corpora can provide evidence of translation equivalents. Also, for the construction of semantic networks, such as BalkaNet that is being constructed using WordNet methodology (Miller, 1990) they can be used to check semantic relations. 3. More specifically, this kind of resource can help to solve the problem of structural derivation by redefining a Serbian entry in dictionaries of the DELAS/DELAF type or by constructing appropriate finite automata. In this case French is a kind of meta-language with which one can try to encompass the differences in translation. For instance, the analysis of Candide’s aligned texts showed that four entries in Serbian DELAS, namely (Engl. baron), (Engl. baroness), (Engl. baron's), (Engl. baroness's) correspond to one entry in French DELAS (Engl. baron), (Vitas 2002). 2. Software for text alignment Two basic approaches are used for automatic text alignment: statistical and structural. A well known program using statistical approach is Vanilla (Danielsson, 1997). This program works with texts segmented in two levels at most. These two levels are usually interpreted as paragraphs and sentences, but can actually be any other tags. The algorithm is based on the presumption that two units that correspond to each other have approximately the same number of characters. As a consequence, it is not possible to align a structurally tagged text with a rough text. Structural tags are not taken into consideration during the alignment process. The program is simple to use and can be obtained as an open code. However, in some cases many interventions in aligned texts have to be done by hand. A serious drawback of the algorithm is the prerequest that both texts have to have the same number of higher level units. The program is not supported by a concordancer, but the latter can be developed independently. One example of aligned units from already mentioned Vern's novel is: *** Link: 1 2 *** En tout cas, il n'était prodigue de rien, mais non avare, car partout où il manquait un appoint pour une chose noble, utile ou généreuse, il l'apportait silencieusement et même anonymement. .EOS U svakom slucyaju nije bio rasipnik, ali ni tvrdica. Gde god je nesxto trebalo za neku plemenitu, korisnu ili velikodusxnu stvar, on je davao cxutecxi i neopazxeno. .EOS XML tag is used in both target and source text for tagging sentence elements. In this example, one source sentence is aligned wirh two target sentences, as is notified by Link: 1 – 2 sign. An example of a structural approach is MLAlign program by Laurent Romary and Patrice Bonhomme. For the use of this program the logical layout of texts has to be XML tagged. The program maps the logical layout of the source text to the logical structure of the target text. A concordancer has been developed that supports it. The problem is that not all of the available resources are XML tagged. Our experience shows that human XML tagging is time consuming and biased by human tagger, while, the automatical tagging is error prone. 3. The use of lexical resources for alignment Various ideas have already been exploited in order to improve the alignment process (for instance, (Chan, 1993)) but none of them concentrates on a selection of a subset of textual units that are, as a rule, literally translated. The idea to use the lexical resources in alignment arose during our first experiments with the exploitation of aligned texts of Candide. This experience showed that the use of Intex (Silberztein, 1993) with each of the texts in turn was more useful than the use of sentence aligned texts. The reason for this is simple: usually the concordances of both the original and the translation are consulted in search for the translation equivalents, and lexical resources incorporated in Intex enabled a precise specification of search requirements for both languages. The experience with aligned texts also suggested that there are certain text units that are more often than not translated literally. In order to check this presumption one simple experiment has been undertaken using texts from the issue of "Le Monde diplomatique" of May 2001 where the main topic are advertisements. The initial presumtion was proven right for certain simple text units, such as dates and currencies. These elements can be described by graphs that are similar in both source and target languages. The distribution of some other lexical units for which simple graphs can be constructed for both languages has shown similar behaviour on the same text samples. Such units are, for instance, toponyms and proper names. Besides these self-evident sets of literally translated text units another set of this kind emerged: it contains those units that Intex identifies as "unknown words" where one usually finds trademarks, different acronyms, etc. In order to identify their occurrences in both original and translated text it is sufficient to construct a graph that recognizes units that belong to the intersection of sets of unknown words in both original and its translation. Those units are not being translated, they are rather transferred to the translated text. For instance: The excerpt from “unknown words” in the French version of “Le Monde Dimpomatique” charisme et en évocation subversive. Benetton assimilera son nom de marque à grandes étiquettes de disques comme BMG envoient désormais des «équipes de roupes (Axa, la Société générale, la BNP, les AGF, les géants de la vente nchise sur la rébellion adolescente, Body Shop disposera de la compassion, adelphie ou Chicago, ils disent «Eh, bro [frère], regarde-moi les baskets», e Nike, décrivait sa conversion à le bro-ing à Harlem: «On est allés à le rme pour désigner cette pratique: le bro-ing. Cette expression vient du fait phrase: «The Gore Prescription Plan: Bureaucrats Decide.» Puis, sur fond eau de vérification de la publicité (BVP), organisme émanant des annonceurs, The corresponding excerpt "unknown words" from the Serbian version yki nastrojenoj reklamnoj industriji. Benetton se poistovecxuje sa borbom elike diskorgrafske kucxe kao sxto je BMG sxalju danas "ulicyne ekipe" grupe („Aksa”, „Sosiete Zxeneral”, „BNP”, „AGF” i giganti prodaje putem , Pepsi je znak mladalacyke pobune, Body Shop saosecxanja, Reebock nastupajucxi ovim recyima: "Hi, bro /brother/! Pogledaj ove teniske!", Nike Aron Kuper ovako je opisao svoj bro-ing sistem u Harlemu: "Otisxli, da je za to stvorila vlastiti termin: bro-ing. Taj izraz je nastao u a, ecyenica: The Gore Prescription Plan: Bureaucrats Decide (Gorov plan nagona za kupovinom. U oktobru 1999. BVP, kojim dominiraju interesi The density of some of these literaly translated lexical units for the original text is illustrated in Figure 1. The density of lexical elements that are literally translated shows that they can be reliable ‘anchors’ for the corresponding segments in the original text and its translation. In principle, using the elements recognized by Intex, it is possible to preedit the text for the aligners of Vanilla type (using the paragraph tags, if they exist, and {S} tags incorporated by Intex for sentences), as well as for MLAlign (XML/SGML output from the FST that recognizes the literally translated lexical units). Figure 1. The frequency of units representing proper names in the original French sample of "Le Monde diplomatique" On the other hand, the alignment process can be seen as a generalization of the bootstrapping method proposed in (Gross, M. 2000) where local grammars are developed step by step in order to cover specific meanings of keywords in concordances. Here, the bootstrapping method is applied in order to incorporate as many corresponding tags in source and target text as necessary to cover it with anchors with appropriate density. Finally, this process of lexical recognition enables XML-tagging of both the original and the translation by tags of the form: textual unit. For instance, the same tag is inserted in the French text ( 33 degrée Celsius ) and in the Serbian text Trideset i tri stepena Celziusa , where the attribute value represents the "canonic" value of the recognized sequences obtained as the output of the corresponding transducer. The information obtained form such tags is twofold: the tag name represents the type of the recognized lexical unit while the attribute value enables the comparison of tags in the source and target text. Only the tags with the same name and attribute value can be potential anchors. Figure 2 The occurrences of all forms of proper names Buvar and Pekiše in Serbian translation

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

INTEX: A Corpus Processing System

INTEX is a text processor; it is usually used to parse corpora of several megabytes. It includes several built-in large coverage dictionaries and grammars represented by graphs; the user inay add his/her own dictionaries and gramnlars. These tools am applied to texts in order to locate lexical and syntactic patterns, remove ambiguities , and tag words. INTEX builds collcordances and indexes of ...

متن کامل

Iranian EFL Learners’ Lexical Inferencing Strategies at Both Text and Sentence levels

Lexical inferencing is one of the most important strategies in vocabulary learning and it plays an important role in dealing with unknown words in a text. In this regard, the aim of this study was to determine the lexical inferencing strategies used by Iranian EFL learners when they encounter unknown words at both text and sentence levels. To this end, forty lower intermediate students were div...

متن کامل

Temporal effects of alignment in text-based, task-oriented discourse

Communicative alignment refers to adaptation to one’s communication partner. Temporal aspects of such alignment have been little explored. This paper examines temporal aspects of lexical and syntactic alignment (i.e. tendencies to use the interlocutor’s lexical items and syntactic structures) in task-oriented discourse. In particular, we investigate whether lexical and syntactic alignment incre...

متن کامل

Learning Paraphrase Identification with Structural Alignment

Semantic similarity of text plays an important role in many NLP tasks. It requires using both local information like lexical semantics and structural information like syntactic structures. Recent progress in word representation provides good resources for lexical semantics, and advances in natural language analysis tools make it possible to efficiently generate syntactic and semantic annotation...

متن کامل

Lexical token alignment: experiments, results and applications

Lexical alignment is one of the most challenging tasks in processing and exploiting parallel texts. There are numerous applications that may benefit from an accurate multilingual lexical alignment of biand multi-language corpora. We describe in this paper a hypothesistesting approach to the problem of automatic extraction of translation equivalents from sentence-aligned and tagged parallel corp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005